Chapter 5 - New Developments: Topic Modeling with BERTopic!#

2022 July 30

bertopic

What is BERTopic?#

  • As part of NLP analysis, it’s likely that at some point you will be asked, “What topics are most common in these documents?”

    • Though related, this question is definitely distinct from a query like “What words or phrases are most common in this corpus?”

      • For example, the sentences “I enjoy learning to code.” and “Educating myself on new computer programming techniques makes me happy!” contain wholly unique tokens, but encode a similar sentiment.

      • If possible, we would like to extract generalized topics instead of specific words/phrases to get an idea of what a document is about.

  • This is where BERTopic comes in! BERTopic is a cutting-edge methodology that leverages the transformers defining the base BERT technique along with other ML tools to provide a flexible and powerful topic modeling module (with great visualization support as well!)

  • In this notebook, we’ll go through the operation of BERTopic’s key functionalities and present resources for further exploration.

Required installs:#

# Installs the base bertopic module:
!pip install bertopic 

# If you want to use other transformers/language backends, it may require additional installs: 
# !pip install bertopic[flair] # can substitute 'flair' with 'gensim', 'spacy', 'use'

# bertopic also comes with its own handy visualization suite: 
# !pip install bertopic[visualization]

Data sourcing#

  • For this exercise, we’re going to use a popular data set, ‘20 Newsgroups,’ which contains ~18,000 newsgroups posts on 20 topics. This dataset is readily available to us through Scikit-Learn:

import bertopic
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

documents = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']

print(documents[0]) # Any ice hockey fans? 
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!

Creating a BERTopic model:#

  • Using the BERTopic module requires you to fetch an instance of the model. When doing so, you can specify multiple different parameters including:

    • language -> the language of your documents

    • min_topic_size -> the minimum size of a topic; increasing this value will lead to a lower number of topics

    • embedding_model -> what model you want to use to conduct your word embeddings; many are supported!

Example instantiation:#

from sklearn.feature_extraction.text import CountVectorizer 

# example parameter: a custom vectorizer model can be used to remove stopwords from the documents: 
stopwords_vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english') 

# instantiating the model: 
model = BERTopic(vectorizer_model = stopwords_vectorizer)

Fitting the model:#

  • The first step of topic modeling is to fit the model to the documents:

# Unhashtag the below line
# topics, probs = model.fit_transform(documents)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  • .fit_transform() returns two outputs:

    • topics contains mappings of inputs (documents) to their modeled topic (alternatively, cluster)

    • probs contains a list of probabilities that an input belongs to their assigned topic

  • Note: fit_transform() can be substituted with fit(). fit_transform() allows for the prediction of new documents but demands additional computing power/time.

Viewing topic modeling results:#

  • The BERTopic module has many built-in methods to view and analyze your fitted model topics. Here are some basics:

# view your topics: 
topics_info = model.get_topic_info()

# get detailed information about the top five most common topics: 
print(topics_info.head(5))
   Topic  Count                                       Name
0     -1   6646                     -1_file_use_need_using
1      0   1838                0_team_games_players_season
2      1    616              1_clipper_encryption_chip_nsa
3      2    527  2_cheek ken_ken huh_ignore art_huh ignore
4      3    452          3_israel_israeli_jews_palestinian
  • When examining topic information, you may see a topic with the assigned number ‘-1.’ Topic -1 refers to all input outliers which do not have a topic assigned and should typically be ignored during analysis.

  • Forcing documents into a topic could decrease the quality of the topics generated, so it’s usually a good idea to allow the model to discard inputs into this ‘Topic -1’ bin.

# access a single topic: 
print(model.get_topic(topic=0)) # .get_topics() accesses all topics
[('team', 0.007645058778587724), ('games', 0.006112662299637617), ('players', 0.005412026399964582), ('season', 0.005342811826876292), ('hockey', 0.005239065199444112), ('league', 0.004280045353200042), ('teams', 0.003990602953367509), ('baseball', 0.0037812052034601833), ('nhl', 0.003514144827427642), ('gm', 0.0029900018153221084)]
# get representative documents for a specific topic: 
print(model.get_representative_docs(topic=0)) # omit the 'topic' parameter to get docs for all topics 
["\ni have no idea, nor do i care.  however, i'd like to point out that\nblomberg got the first plate appearance by a designated hitter, and\nthe first walk by a designated hitter.  i am not sure, but i do not\nthink that he also got the first hit by a designated hitter.", ": >\n: >ATLANTIC DIVISION\n: >\t\n: >\tST JOHN'S MAPLE LEAFS VS MONCTON HAWKS\n: >\tMONCTON HAWKS\n: >See CD Islanders. Moncton is a very similar team to CDI. Low scoring,\n: >defensive, good goaltending. John Leblanc and Stu Barnes are the only\n: >noticable guns on the team. But the defense is top notch and \n: >Mike O'Neill is the most underrated goalie in the league.\n: >\n\n: Bri, as I have tried to tell you since 2 February, Michael O'Neill\n: might be the most underrated goalie in the AHL, but he ISN'T in the\n: AHL.  He's on the Winnipeg Jets' injury list, as he has been since\n: his first NHL start against the Ottawa Senators.  He's out until\n: next year after surgery to repair a shoulder separation.\n\n: Stu Barnes might be an AHL gun for the Hawks, but he's now the third\n: line center with the Jets, and has been since mid January or so.\n\nSorry, my memory is gone. I thought that O'Neill got sent back\ndown in February but I must have been given incorrect info. I guess\nthis says it all about Moncton because Barnes is still one of\ntheir top 3 or so scorers even though he's been out since January.", "\n\nI didn't see any smilies in this message so.......\n\n                W     T    L    PTs\n   Team A      50    30    4    104\n   Team B      52    32    0    104\n\n\nThere you go.  Two teams that tie in points without identical records.\n\n"]
# find topics similar to a key term/phrase: 
topics, similarity_scores = model.find_topics("sports", top_n = 5)
print("Most common topics:" + str(topics)) # view the numbers of the top-5 most similar topics

# print the initial contents of the most similar topics
for topic_num in topics: 
    print('\nContents from topic number: '+ str(topic_num) + '\n')
    print(model.get_topic(topic_num))
    
Most common topics:[0, 30, 6, 166, 4]

Contents from topic number: 0

[('team', 0.007645058778587724), ('games', 0.006112662299637617), ('players', 0.005412026399964582), ('season', 0.005342811826876292), ('hockey', 0.005239065199444112), ('league', 0.004280045353200042), ('teams', 0.003990602953367509), ('baseball', 0.0037812052034601833), ('nhl', 0.003514144827427642), ('gm', 0.0029900018153221084)]

Contents from topic number: 30

[('games', 0.03260548961663573), ('sega', 0.02366315012814771), ('arcade', 0.012166539858844822), ('snes', 0.010883627526511617), ('sega genesis', 0.01081910740506706), ('joysticks', 0.010294764495945618), ('games sale', 0.010085068481475858), ('sale', 0.00964091677280479), ('joystick', 0.009006639792149954), ('sega cd', 0.0074012373591723)]

Contents from topic number: 6

[('riding', 0.011792240692170709), ('ride', 0.011256591323418531), ('driving', 0.007418204752466058), ('road', 0.007362304673149508), ('traffic', 0.006971330162717447), ('roads', 0.005093305390738552), ('bikes', 0.0046328368271995445), ('bikers', 0.0041220512073587194), ('riders', 0.0037367046265679754), ('passengers', 0.0035386604055364823)]

Contents from topic number: 166

[('religion', 0.024810151190057972), ('war', 0.01958713595572545), ('wars', 0.0141305144151792), ('crusades', 0.012827683749926261), ('history', 0.01202363443416338), ('religious', 0.009458363539211138), ('unbelievers', 0.008338773663764506), ('yoked unbelievers', 0.007970064155940823), ('statement religion', 0.007495172035922859), ('gods', 0.0071255212864334274)]

Contents from topic number: 4

[('health', 0.0072259305085357), ('cancer', 0.005975505039095839), ('disease', 0.00513078203584376), ('tobacco', 0.005069613472607038), ('medical', 0.00492433353954727), ('hiv', 0.004709304265420622), ('malaria', 0.004112010029452724), ('smokeless tobacco', 0.004033769948845448), ('lyme', 0.003923377448522405), ('medical newsletter', 0.003903230753928965)]

Saving/loading models:#

  • One of the most obvious drawbacks of using the BERTopic technique is the algorithm’s run-time. But, rather than re-running a script every time you want to conduct topic modeling analysis, you can simply save/load models!

# save your model: 
# model.save("TAML_ex_model")
# load it later: 
# loaded_model = BERTopic.load("TAML_ex_model")

Visualizing topics:#

  • Although the prior methods can be used to manually examine the textual contents of topics, visualizations can be an excellent way to succinctly communicate the same information.

  • Depending on the visualization, it can even reveal patterns that would be much harder/impossible to see through textual analysis - like inter-topic distance!

  • Let’s see some examples!

# Create a 2D representation of your modeled topics & their pairwise distances: 
model.visualize_topics()
# Get the words and probabilities of top topics, but in bar chart form! 
model.visualize_barchart()
# Evaluate topic similarity through a heat map: 
model.visualize_heatmap()

Conclusion#